Model Selection

Cross-modal Reasoning

# Cross-modal Reasoning

Ristretto is an innovative vision-language model that employs dynamic image token deployment technology, allowing flexible adjustment of image token quantities based on task requirements, surpassing previous generations in performance and versatility.

Transformers Supports Multiple Languages

Chattime 1 7B Chat

ChatTime is a multimodal foundation model that unifies time series and text processing, featuring zero-shot forecasting capabilities and supporting dual-modal input/output for both time series and text.

Multimodal Fusion

ChemVLM is a multimodal large language model focused on applications in the chemical field, combining text and image processing capabilities.

Meta Chameleon is a hybrid-modality early-fusion foundational model developed by FAIR, supporting multimodal processing of images and text.

Multimodal Fusion

CogVLM is a powerful open-source vision-language model that achieves leading performance in multiple cross-modal benchmarks

Transformers English

Pix2struct Infographics Vqa Large

Pix2Struct is an image encoder-text decoder model trained through multi-task learning for visual-language understanding tasks, specifically optimized for visual question answering on high-resolution infographics.

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase